Improving Gender Classification of Blog Authors

نویسندگان

  • Arjun Mukherjee
  • Bing Liu
چکیده

The problem of automatically classifying the gender of a blog author has important applications in many commercial domains. Existing systems mainly use features such as words, word classes, and POS (part-ofspeech) n-grams, for classification learning. In this paper, we propose two new techniques to improve the current result. The first technique introduces a new class of features which are variable length POS sequence patterns mined from the training data using a sequence pattern mining algorithm. The second technique is a new feature selection method which is based on an ensemble of several feature selection criteria and approaches. Empirical evaluation using a real-life blog data set shows that these two techniques improve the classification accuracy of the current state-ofthe-art methods significantly.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Gender Classification with Deep Learning

For our project, we consider the task of classifying the gender of an author of a blog, novel, tweet, post or comment. Previous attempts have considered traditional NLP models such as bag of words and n-grams to capture gender differences in authorship, and apply it to a specific media (e.g. formal writing, books, tweets, or blogs). Our project takes a novel approach by applying deep learning m...

متن کامل

Gender-Specific English Language Use of Malaysian Blog Authors

Gender-based research on the language use in blogs has its roots in the long-standing notion that men and women speak and write differently. This paper reports an empirical study on the use of English in a blog context involving Malaysian blog authors. Specifically, the study aimed to identify gender-specific English use among Malaysian blog authors and determine the differences in the language...

متن کامل

Automatic Author Profiling Based on Linguistic and Stylistic Features Notebook for PAN at CLEF 2013

The rapid expansion of blog and electronic data in Web 2.0 is abounding and thus it is becoming important to identify the author‟s profile also. The problems of automatic identification of author‟s gender and age based on linguistic and stylistic pattern have been a subject of increasingly research interest in the recent years. The research methodologies are also helpful for several other appli...

متن کامل

Predicting gender from blog posts

Blogs are informal, personal writings that people post on their own blog sites. Nowadays, blogging is an important online activity. People share blogs with their friends and family members. The topics of blog posting cover almost everything, ranging from personal life, political opinions, recipes, product reviews, or even just random rants. Although some bloggers review their biologically infor...

متن کامل

Gender Classification of Weblog Authors

In this paper, we present a Naı̈ve Bayes classification approach to identify genders of weblog authors. In addition to features employed in traditional text categorization, we use weblog-specific features such as web page background colors and emoticons. Our results in progress, although preliminary, outperform the chosen baseline. They also suggest room for significant improvement once more adv...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2010